Skip to content

WIP: Dict Array Extension #29557

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed

Conversation

WillAyd
Copy link
Member

@WillAyd WillAyd commented Nov 11, 2019

modeled after some of @TomAugspurger great work in #27949 here is a dirty implementation of a DictArray. Marking as a WIP as still need to work through all of the remaining extension array types, though figure can put out there for feedback if anyone cares in the meantime

Definitely some overlap with the existing JSONArray; could integrate some aspects of that into this or vice versa

@WillAyd WillAyd added the ExtensionArray Extending pandas with custom dtypes or arrays. label Nov 11, 2019
@jbrockmendel
Copy link
Member

Little bit of a wet blanket, but... why? nested objects are among the corneriest of corner cases; i would expect we'd want to discourage them. Or is the idea to stuff all of the corner stuff into a couple of EAs?

@WillAyd
Copy link
Member Author

WillAyd commented Nov 12, 2019 via email


def __setitem__(self, key: Hashable, value: Any):
def check_value(value):
if not (pd.isnull(value) or isinstance(value, dict)):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could potentially make this pretty class pretty generic adding an abstract class for all Container types then leave it to subclasses for a few minor things to clarify list vs set vs dict

raise ValueError(
"DictArray requires a sequence of dicts. Got "
"'{}' dtype instead.".format(self._ndarray.dtype)
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to extend this validation for object dtypes on construction that contain more than just dicts...

@jorisvandenbossche
Copy link
Member

Question about the scope of this DictArray: in general, I think there are two possible cases: fixed keys and variable keys.

In Arrow terminology, this is called StructArray (fixed set of keys in each record) or MapArray (variable set of keys in each record). Those are regarded separately, because they are stored in memory quite differently (the StructArray stores one contiguous array of values for each key per column and can have different types for the values of the different keys, while MapArray requires all values to be of the same type and stores them in one single array with offsets, and also the keys in one contiguous array). Also Spark makes the same distinction.

A python dict is more close to the map type (but also with variable types for the values), but just wondering to make the specification explicit.

@TomAugspurger
Copy link
Contributor

A bit more on scope / names: postgres has a JSON data type that basically combines nested data support for lists and mappings. Would we want that, or is there value in separating lists from mappings at a dtype level?

If we split lists and dicts into their own extension dtypes, I'd vote to call this a MappingArray rather than a DictArray. It's a bit more general, and I think at some point we'll have a "categorical without fixed categories" array, which pyarrow calls a dictionary encoded array. I'm not sure we would call that a DictEncodedArray, but just in case.

@jreback
Copy link
Contributor

jreback commented Nov 12, 2019

yeah I think we would be well served by the distinct types that @TomAugspurger indicates a nested type that holds dicts, lists, scalars and a DictEncoded type that backs Categorical and String

a pure Dict type is fine for testing

@WillAyd
Copy link
Member Author

WillAyd commented Nov 12, 2019 via email

@TomAugspurger
Copy link
Contributor

downside to a Mapping type is that it by definition wouldn’t support assignment so we’d end up with a MutableMapping array as well or just not care and allow assignment to Mapping.

Being a mapping doesn't mean you're immutable. It just doesn't guarantee that you are mutable, if that makes sense :)

In [3]: import collections.abc

In [4]: isinstance(dict(), collections.abc.Mapping)
Out[4]: True

@WillAyd
Copy link
Member Author

WillAyd commented Nov 12, 2019 via email

@jorisvandenbossche
Copy link
Member

But do we want the mapping elements to be mutable?

Should something like s[0]['a'] = new_value (assigning to a key of the first element (mapping) of the series) be allowed / update the series in place?

In general, IMO we should think a bit more about use cases for those different options. What do other languages do? What kind of types are users looking for?

Putting dicts in an object array is certainly the easiest to get a basic array working, and might be the closest to the objects available in Python, but it doesn't necessarily give what many users might need / want (regarding mutability of values, expanding in place with new keys, variable key and value types, ...). Eg if we want an optimized implementation later, those requirements might be very hard to implement.

@WillAyd
Copy link
Member Author

WillAyd commented Nov 27, 2019

tight on time recently...may come back

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants